# Replication materials for The causes and consequences of COVID-19 misperceptions

This replication file contains all the files and data necessary for the replication of *The causes and consequences of COVID-19 misperceptions*.

This replication requires both R and Stata. The authors used R 3.6.1 and Stata 2013 with updated packages as of 06/17/2020.

The replication materials consist of this README, data, and two scripts. The survey data is available in survey.dta while the social media and news media data consists of four files, two each for the full dataset and for the manually classified content.

* survey.dta: relevant variables from survey instrument used.
* survey_analysis.do: Stata script which recodes and analyzes the survey data.
* main.Rmd: the only file that needs to be run. It analyzes the Twitter and news data and runs the .do script. To run the .do script, the replicator must point R to a local executable Stata file. Tidyverse, binom, and RStata packages are required.
* news_classified.rds: this contains the 1127 manually classified news stories that concerned the four themes of misinformation described in the paper.
* news.rds: this contains the full 8857 news articles that contain COVID-related information. Full-text has not been included but title and news site is available for individual articles to be researched.
* tweets.rds: the full 6.85M tweets from the pre-, during-, and post-periods. status_id has been included, but unfortunately, Twitter TOS prohibits sharing full tweet data.
* tweets_classified.rds: a manually classified set of 2000 tweets, 500 each for the four categories of misinformation described in the paper.

For both the news.rds and tweets.rds, the full data collected was already subjected to the following dictionary in R and filtered for covid > 0. If you are interested in the full dataset, please reach out.

misinfo_dictionary <-  quanteda::dictionary(list(
  covid = c('coronavirus','covid','covid-19'),
  flu = c('flu', 'harmless'),
  bat = c('bat','bats'),
  vit = c('vitamin-c','vitamin c'),
  hand = c('hand wash', 'wash hands', 'hand washing', '30 seconds', 'thirty seconds', 'with soap','wash your hands','hand sanitizer',
           'avoid touching your eyes', 'avoid touching your face','avoid touching your mouth'),
  soc = c('work from home','stay home', 'stay at home','avoid all non-essential trips','gather in groups',
          'avoid places', 'avoid public', 'avoid crowds','avoid gatherings','grocery delivery',
          'avoid large gatherings','avoid small gatherings','and small gatherings','and large gatherings',
          'limit events','limit meetings','self-isolate','isolation','must isolate', 'social distance','social distancing','six feet','6 ft', '2 meter', '2 meters', 'maintain distance'),
  qua = c('quarantine','14 days','travel outside of Canada'),
  motta = c('made in a lab', 'big pharma', 'george soros', 'hoax', 'conspiracy', 'bioweapon', 'not real', 'existing vaccine'),
  conspiracy = c('Hoax', 'fraud', 'deception', 'swindle', 'dupe', 'con', 'trick', 'deceive',
                 'scam', 'scheme', 'racket', 'overblown', 'exaggerated', 'overdone', 'inflated',
                 'embellished', 'hyperbolic', 'conspiracy','hyperbole', 'harmless')
))

apply_dictionary <- function(df, dict_to_apply = misinfo_dictionary) {

  i <<- i + 1

  if (i %% 100 == 1) {
    print(paste0("Now on iteration...",i,"."))
  }

  df_corp = df %>%
    select(docid_field = status_id, text) %>%
    corpus()

  df = df_corp %>%
    tokens() %>%
    tokens_lookup(dictionary = dict_to_apply) %>% dfm() %>% convert(to = "data.frame") %>%
    cbind(docvars(df_corp)) %>%
    select(-document) %>%
    cbind(df, .)

  return(df)

}

apply_to_tweets = function(full_df, dict_to_apply = misinfo_dictionary) {

  start = Sys.time()
  i <<- 0
  # Divide into chunks of 1000 tweets
  classified = full_df %>%
    mutate(row_id = row_number(), group = row_id %% ceiling(nrow(full_df)/1000)) %>%
    group_split(group) %>%
    lapply(., apply_dictionary, dict_to_apply = dict_to_apply) %>%
    bind_rows()

  print(paste0("For ", nrow(full_df), " tweets the function took ", round(Sys.time() - start,2),"."))

  return(classified)

}

tweets_classified <- apply_to_tweets(tweets)

# Random selection of tweets for manual coding

unique_texts = tweets %>% distinct(text, flu, bat, vit, conspiracy)

set.seed(0); flu = unique_texts %>% filter(flu > 0) %>% sample_n(500)
set.seed(0); bat = unique_texts %>% filter(bat > 0) %>% sample_n(500)
set.seed(0); vit = unique_texts %>% filter(vit > 0) %>% sample_n(500)
set.seed(0); con = unique_texts %>% filter(conspiracy > 0) %>% sample_n(500)
